\documentclass[10pt,twocolumn,letterpaper]{article}

\usepackage{cvpr_rebuttal}
\usepackage{times}
\usepackage{epsfig}
\usepackage{graphicx}
\usepackage{amsmath}
\usepackage{amssymb}
\usepackage{enumitem}


% Include other packages here, before hyperref.

% If you comment hyperref and then uncomment it, you should delete
% egpaper.aux before re-running latex.  (Or just hit 'q' on the first latex
% run, let it finish, and you should be clear).
\usepackage[pagebackref=true,breaklinks=true,letterpaper=true,colorlinks,bookmarks=false]{hyperref}

%%%%%%%%% PAPER ID  - PLEASE UPDATE
\def\cvprPaperID{1352} % *** Enter the CVPR Paper ID here
\def\httilde{\mbox{\tt\raisebox{-.5ex}{\symbol{126}}}}

\usepackage{color}
\usepackage{cite}
\usepackage{booktabs}

\definecolor{gray}{rgb}{0.5, 0.5, 0.5}
\definecolor{darkgray}{rgb}{0.4, 0.4, 0.4}
\definecolor{scarlet}{rgb}{1, 0.35, 0.15}
\definecolor{blue}{rgb}{0, 0, 1}
\definecolor{skyblue}{rgb}{0.3, 0.7, 0.9}
\definecolor{darkgreen}{rgb}{0.2, 0.7, 0.1}
\definecolor{darkyellow}{rgb}{1, 0.65, 0}

\providecommand{\q}[1]{{\color{black}\vspace{0.1em}\noindent\textbf{#1}}}
\providecommand{\qs}[1]{\noindent\textit{#1}}
\providecommand{\qss}[1]{\vspace{0em}\noindent\textit{#1}}
\providecommand{\ra}{\hspace{0.1em}\noindent$\rightarrow$~}
\providecommand{\ans}[1]{{\color{darkgray}\ra{#1}}}
\providecommand{\nin}{\noindent}
\providecommand{\cmt}[1]{{\color{scarlet}{#1}}}

\providecommand{\todo}[1]{{\protect\color{red}{\bf [TODO: #1]}}}
\providecommand{\daniel}[1]{{\protect\color{blue}{\bf [Daniel: #1]}}}
\providecommand{\mohammad}[1]{{\protect\color{magenta}{\bf [Mohammad: #1]}}}

\newcommand{\rone}{{\color{red}R1}}
\newcommand{\rthree}{{\color{darkgreen}R3}}
\newcommand{\rfour}{{\color{blue}R4}}

\providecommand{\todo}[1]{{\protect\color{red}{\bf [TODO: #1]}}}

\usepackage{ifthen}
\newboolean{sepline} 
\setboolean{sepline}{true} % boolvar=true or false

%\renewcommand{\baselinestretch}{.93}

\begin{document}

%%%%%%%%% TITLE - PLEASE UPDATE
\title{Rebuttal for IQA: Visual Question Answering in Interactive Environments}  % **** Enter the paper title here

\maketitle
\thispagestyle{empty}


%%%%%%%%% BODY TEXT - ENTER YOUR RESPONSE BELOW

We thank all reviewers (\rone, \rthree, \rfour) for their thoughtful comments. The reviewers have noted that the newly proposed problem of Interactive Question Answering (IQA) is interesting (\rone, \rthree, \rfour) and challenging (\rone, \rthree, \rfour), and the proposed model is novel (\rone) and well suited to the task (\rfour). We address each comment individually.

\rthree's main concerns are about the paper's organization and based on that, surprisingly, has graded us with the lowest score. This is not consistent with the other reviewers. We do however use \rthree's suggestions to improve our paper's organization (listed in detail below). 

\ifthenelse {\boolean{sepline}}{
\vspace{-0.5em}\noindent\rule[0em]{\linewidth}{1pt}\vspace{-1.5em}
}{}

\noindent \rone-1: \textbf{Motivation behind the score metric:} We devised the score metric as an inverse exponential in order to reward accurate agents while penalizing agents that take a long time to answer. We chose the constants to balance scores awarded to a random agent that answers immediately vs an accurate agent that takes its time.

\noindent \rone-2: \textbf{Seen environments:} Seen environments are rooms in which we have trained, but the specific object placements and corresponding questions are not in the training set. \emph{Seen} environments are still challenging due to novel object placements but a bit easier than \emph{Unseen} which consist of never-before-seen rooms. We will add this note to the paper.

\noindent \rone-3: \textbf{Dataset biases:} We took care to create IQADATA in a balanced way (Line 434). This is similar to past datasets (eg. VQA-2.0). We will also update our accuracy table with a baseline that answers the most common value (exploiting trivial biases). This obtained accuracies of 57, 27 and 52 (compare to Table 2). Nearly all the learned models outperform this by a reasonable margin. (Also in supp. material).

\noindent \rone-4: \textbf{Joint training vs. joint+pre-training:} The full system is trained jointly. However, since the individual tasks learned by the controllers are fairly independent, we are able to pre-train each one separately. Our initial analysis showed that this leads to faster convergence and better accuracy than training end-to-end from scratch.

\ifthenelse {\boolean{sepline}}{
\vspace{-0.5em}\noindent\rule[0em]{\linewidth}{1pt}\vspace{-1.5em}
}{}

\noindent \rthree-1: \textbf{Organization of the paper:} We have taken several steps to reorganize the paper, enabling us to add detail while trimming redundancies. We will also create a website with the dataset, more details on the model, and will release our code. Our changes include the following:
\begin{itemize}[noitemsep,nolistsep,leftmargin=*]
    \item \textbf{Introduction:} We have reorganized the introduction into an initial section which defines the problem and a \emph{contributions} section which outlines the model and dataset. We have trimmed down redundant information.
    \item \textbf{Related Work:} IQA touches on many important areas of active research, which is why the related work section is quite long. We have trimmed down some explanations of past works to reduce the length of this section.
    \item \textbf{Model:} We moved the details relating to inputs, outputs, and optimization (see \rthree-3) to the ``Planner'' and ``Low level controllers'' sections, and added more detail.
    \item \textbf{Experiments:} We combined tables 2 and 4 into a single table. This includes the most common answer baseline requested by \rone. We have also added more details about the ablation studies and added qualitative results. 
\end{itemize}

\noindent \rthree-2: \textbf{Planner's action space:}  The action space consists of 25 navigation request actions (for the 5x5 grid locations in front of the agent), rotate left and right by 90 degrees, look up and down by 30 degrees, open and close, and answer. We will add this to the Planner section. 

\noindent \rthree-3: \textbf{Each controller's inputs, outputs, optimization criteria:} Most of the comments below are seen in section 4.4. However we will modify the wording to make it clearer.
\begin{itemize}[noitemsep,nolistsep,leftmargin=*]
    \item \textbf{Navigator:} \textit{Inputs}: images from the environment. \textit{Outputs}: an occupancy grid in the 5x5 region directly in front of the agent. These are stitched together to create a global occupancy map which is fed to a shortest paths algorithm. \textit{Optimization}: supervised sigmoid cross-entropy loss.
    \item \textbf{Answerer:} \textit{Inputs}: partially filled semantic spatial maps, question. \textit{Outputs}: probabilities over the possible answers. \textit{Optimization}: supervised softmax cross entropy.
    \item \textbf{Planner:} \textit{Inputs}: image features, semantic map features, question, and previous action. \textit{Outputs}: probabilities that each action is doable, a policy (probability distribution) for which action should be performed, and a value function. \textit{Optimization}: The viability prediction is learned via a supervised sigmoid cross-entropy, and the policy and value functions are learned with the A3C algorithm with rewards for exploration and answering correctly and penalties for each timestep as well as invalid actions.
\end{itemize}

\ifthenelse {\boolean{sepline}}{
\vspace{-0.5em}\noindent\rule[0em]{\linewidth}{1pt}\vspace{-1.5em}
}{}

\noindent \rfour-1: \textbf{Synthetic data and lack of real-world experiments:} 
While this is a valid concern, it is a general concern for the larger RL community and not restricted to our paper. The RL community tends to use simulated environments because running agents in the real world is prohibitive from the standpoint of cost, scale and research reproducibility (stated in the introduction). Our dataset can be easily used and our models and metrics can be easily reproduced by other researchers. We chose the AI2-THOR environment specifically because of its photo-realism with the eventual goal of being able to train models on synthetic scenes that transfer to the real-world. This has been shown in past work [92]. However, this remains a challenge for tasks such as IQA. While [92] required a robot that only navigated around, we would require a robot that can safely interact with the real world (open fridges, drawers, etc). Such robots aren't available today for a reasonable sum of money, but one can expect them to be available in the future.

%Transfer to real world scenarios poses several challenges and is an interesting future direction.

\newpage

\end{document}
